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Abstract 

1 Background: High throughput data are complex and methods that reveal structure underlying the data are 
most useful. Principal component analysis, frequently implemented as a singular value decomposition, is a popular 
technique in this respect. Nowadays often the challenge is to reveal structure in several sources of information (e. 
g., transcriptomics, proteomics) that are available for the same biological entities under study. Simultaneous 
component methods are most promising in this respect. However, the interpretation of the principal and 
simultaneous components is often daunting because contributions of each of the biomolecules (transcripts, 
proteins) have to be taken into account. 

2 Results: We propose a sparse simultaneous component method that makes many of the parameters redundant 
by shrinking them to zero. It includes principal component analysis, sparse principal component analysis, and 
ordinary simultaneous component analysis as special cases. Several penalties can be tuned that account in different 
ways for the block structure present in the integrated data. This yields known sparse approaches as the lasso, the 
ridge penalty, the elastic net, the group lasso, sparse group lasso, and elitist lasso. In addition, the algorithmic 
results can be easily transposed to the context of regression. Metabolomics data obtained with two measurement 
platforms for the same set of Escherichia coli samples are used to illustrate the proposed methodology and the 
properties of different penalties with respect to sparseness across and within data blocks. 

3 Conclusion: Sparse simultaneous component analysis is a useful method for data integration: First, simultaneous 
analyses of multiple blocks offer advantages over sequential and separate analyses and second, interpretation of 
the results is highly facilitated by their sparseness. The approach offered is flexible and allows to take the block 
structure in different ways into account. As such, structures can be found that are exclusively tied to one data 
platform (group lasso approach) as well as structures that involve all data platforms (Elitist lasso approach). 

4 Availability: The additional file contains a MATLAB implementation of the sparse simultaneous component 
method. 



Background 

The integrated analysis of multiple data sets obtained 
for the same biological entities under study, has become 
one of the major challenges for data analysis in bioinfor- 
matics and computational biology. Two main causes for 
this trend are the availability of complementary mea- 
surement platforms and the systemic approach to biol- 
ogy; in both cases, multiple data sets are obtained on 
the same set of samples (e.g., culture samples, tissues). 
First, examples where several measurement platforms 
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are included are the study of the metabolome composi- 
tion of Escherichia coli (E. coli) using several analytical 
chemical methods to screen for metabolites [1] and the 
combination of cDNA and Affymetrix chips applied to 
sixty cancer cell lines [2]. In both examples, there is 
overlap in the metabolites or genes screened but also 
complementarity. Second, the modern systemic 
approach to biology leads to a probing of the biological 
system on different levels in the cellular organization, 
such as for example the transcript, protein, and metabo- 
lite level [3]. These approaches lead to situations where 
several data blocks are obtained that are coupled in the 
sense that they were obtained for the same set of sam- 
ples. A key issue in integrative data analysis is to analyze 
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such data simultaneously instead of separately or 
sequentially as this yields an aggregated view. In this 
respect, simultaneous component methods, that are an 
extension of principal component analysis (PCA) to the 
case of multiple coupled data blocks, were proposed and 
successfully used [4-7]. 

However, a drawback of component based methods 
like PCA is their lack of sparseness: Processes underlying 
the data are revealed by a weighted combination of all 
variables (these are the genes, transcripts, proteins, meta- 
bolites in the aforementioned examples). From an inter- 
pretational point of view, this is not very attractive and it 
also does not reflect that biological processes are 
expected to be governed by a limited number of genes 
[8]. The problem holds even more for simultaneous com- 
ponent methods as these involve multiple large sets of 
variables. To deal with this issue, sparse approaches have 
been proposed mainly within the context of regression 
analysis (e.g., [9,10]) but also for principal component 
analysis [8,11-14]: These select a limited number of vari- 
ables by shrinking many of the weights to zero which is 
accomplished by proper penalization of these (regression) 
weights. A favorable characteristic of such penalty based 
methods is that the selection is built-in (in contrast to, 
for example, first filtering and then doing the regression/ 
PCA). Here, we extent sparse PCA to sparse simulta- 
neous component methods, accounting for the fact that 
the data are structured in several data blocks holding 
both shared and complementary information. The esti- 
mation procedure used is efficient and the associated 
MATLAB code can be found in the additional file. 

First, we present the sparse simultaneous component 
model, starting from ordinary principal component ana- 
lysis and sparse PCA. A generic modeling framework is 
introduced that incorporates several types of penalties. 
Then we present some results for metabolomics data 
obtained with two measurement platforms for the same 
set of E. coli samples and we validate the method by 
means of simulated data. 

Results 

Algorithm 
Notation 

We will make use of the following formal notation: 
matrices are denoted by bold uppercases, vectors by 
bold lower case, the transpose by the superscript T , and 
the cardinality by the capital of the letter used to run 
the index (e.g., this paper deals with K data matrices 
with k running from 1 to K), see [15]. 

Throughout the paper, we suppose that all variables 
are mean-centered and scaled to norm one. 
Model 

Simultaneous component analysis is an extension of 
principal component analysis (PCA) to the case of 



multiple coupled data matrices. Consider the PCA of a 
single data block X k containing the scores of l k objects 
(e.g., batches, arrays) on J k variables (e.g., metabolites, 
genes). In a first model formulation [16] based on com- 
ponent scores, PCA decomposes the data as follows, 

X fe = T fe P[ + E fe (1) 

with the component scores of the l k objects on the 
R components, (of size J k x R) the loadings, and E k 
(of size I k x J k ) the matrix of residuals. To identify the 
model, usually the constraints are imposed that the axes 
have a principal axes orientation and that the compo- 
nent scores are orthogonal: T^Tj, = I. Another formula- 
tion of the PCA model is based on component weights 
as follows [17], 

X fe = XfeWfeP^ + Efe (2) 

with Wk (of size J k x R) the component weights. Note 
that we can write T k = X^W* resulting in the equiva- 
lence of models (1) and (2). However, usually (2) is con- 
strained to have orthogonal weights: W^Wfe = I. Note 
that under a least squares approach to PCA, = 
and thus also PfePfe = I- The principal components are 
interpreted by considering the contribution of the vari- 
ables to the components. For the score-based model (1) 
this is based on the fact that the loadings are equal to 
the correlation of the variables with the components 
(we suppose the variables to be mean-centered and 
scaled to norm one). Let be the /th variable in data 
block k and t rk the rth component for block k, then 

r (Xjfe, t,-fe) = pj,k, (3) 

with r(.,.) used as a notation for correlation and pj r k 
the loading of the y'th variable on the rth component of 
block k. In the weight-based model (2), interpretation of 
the components is based on the weights as these express 
each component as a weighted linear combination of the 
variables, 

t r fe = XfeWrfe. (4) 

For both model formulations this implies that for each 
component a total of J k correlations or weights have to 
be taken into account in the interpretation. Especially in 
the case of omics data, that usually consist of thousands 
of variables, there is a need for methods that facilitate 
interpretation. To that end, [14] proposed a sparse 
PCA method for the weight based model (2), that 
shrinks a (large) number of component weights to zero. 
Their method is based on a least-squares approach to 
PCA model (2) in which the objective function is aug- 
mented with an li penalty (also named lasso) and an l\ 
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(ridge) penalty: Minimize with respect to and P^- 
L(Wfe, P fc ) = 

|| Xk - XfeWfeP^f 2 + XtllWkH, + X R ||Wfe]|| , 

such that P^Pfe = I and with Xj, > 0 and X# > 0 tuning 
parameters for the lasso and ridge penalties respectively, 
II Will ! = E A ,r and l|W fe || 2 = I^r^r. The /asso, 

tuned by the parameter X L , has the property to simulta- 
neously shrink coefficients and select variables, keeping 
only those variables with the highest coefficients. The 
higher X L , the stronger the shrinkage and selection. 
Note that the selection is done in an unstructured way 
meaning that correlations between variables are not 
taken into account. The ridge penalty, tuned by 'Kr, only 
shrinks the coefficients and does not perform variable 
selection (none of the coefficients becomes zero). It is 
often introduced when it is of interest to group corre- 
lated variables [10] or in case of ill-conditioned optimi- 
zation problems (see [18]) to solve the non-uniqueness 
of the parameter estimates. A particular case is regres- 
sion analysis with more variables than objects, J k >I k , 
which yields an under determined estimation problem. 
In the context of PCA, this is of relevance for model (5) 
because the estimation of the component weights boils 
down to a regression analysis. Adding the ridge penalty 
with k R > 0 solves the non-uniqueness; in addition, with 
the appropriate normalization, the ridge ensures that the 
solution of (5) yields the principal components in case 
1 L = 0 (see [14]). 

The simultaneous component decomposition of K 
coupled data blocks having a common set of samples 
(so I\ = ... = Ik = I) is given by imposing the constraint 
that all are equal. Applied to the score based model 
this gives: 

X fe = JPl + E fe , (6) 

for all k and under the constraints of a principal axes 
orientation and orthogonality of the component scores: 
T T = I. Applying the idea of a common matrix of 
component scores to the weight based model as used by 
[14], can be realized as follows, 

[Xi...Xjd = 

[X 1 ...X J dK...w£f[P[...P£] (7) 
+ [Ei . . . E,f ] 

= T[P[...P£] + [E 1 ...E J d, (8) 

under the constraint of a principal axes orientation 
and orthogonal loadings: [P[ . . . P£] [P[ . . . P£] T = I. 



Simultaneous component model (7) shows that the 
common component scores T lie in the space spanned 
by all variables, this is from all data blocks. For ease of 
notation, we will use the shorthand notation X c = [X x ... 

X K ] (of size / x Y k J k ) and P c = [p[ . . . p£] T and 

W c = [W[ . . . W£] T (both of size I. k J k x R). Note that 

several simultaneous component models were proposed 
in the literature: [6] gives an overview that emphasizes 
the different ways of weighting the data blocks in con- 
nection to different principles to realize a fair integra- 
tion of the data. 

The problem that a lot of variables have to be taken 
into account when interpreting the components is exa- 
cerbated in the case of simultaneous component analysis 
as this involves several blocks of variables. To solve for 
this problem, we propose to go for a sparse simulta- 
neous component method by penalizing either the 
loadings (in the context of the score based model) or 
the component weights (in the context of the weights 
based model) within a least-squares approach. One pos- 
sibility, in line with sparse PCA, is to use the lasso pen- 
alty if necessary in conjunction with a ridge penalty 
(when grouping of correlated variables is of interest or 
when T. k J k >I). However, other types of penalties can be 
used that, when selecting variables, explicitly take into 
account that variables belong to (pre-defined) groups/ 
blocks by selecting variables within blocks only, between 
blocks only (by setting all weights/loadings of an entire 
block to zero, i.e. dropping an entire group of variables 
at once), or both within and between blocks. A penalty 
that introduces selection only within each group is Eli- 
tist lasso (mixed / li2 norm), defined for the rth compo- 
nent as 

*eX!ii w *iii,2=*eX! ($Ik*i ) • w 

k k \ jk ) 

Elitist lasso was introduced by [19] in the context of 
regression analysis. The behavior of this penalty can be 
understood by observing that it behaves as the lasso 
within blocks and as the ridge between blocks, resulting 
in shrinkage and a selection of the variables with the 
highest coefficients within each block (lasso) and a 
shrinkage but with no selection between blocks (ridge). 

To select entire (pre-defined) groups of variables, the 
group lasso [20] was introduced. It uses the Euclidean 
norm (also known as a mixed l2,\ norm; see [19]) of the 
group coefficients as a penalty, 

>-gX!v^IIw*|| 2 = a g J] hJ2( w fkr*)- (10) 
k k y jk 
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This penalty behaves as the lasso at the block level 
and as the ridge within blocks: within blocks shrinkage 
and grouping of correlated variables occurs however 
with no selection (behavior of the ridge penalty); 
between blocks selection of those blocks with the high- 
est sum of squared coefficients occurs while other 
blocks are dropped (behavior of the lasso). The group 
lasso applied to groups consisting of one variable only is 
the same as the lasso. (Note that taking the square root 
of a squared value is the same as taking the absolute 
value.) To obtain also sparsity within the groups that 
are not dropped by the group lasso, [21] proposed the 
sparse group lasso that blends the lasso with the group 
lasso and implies shrinkage and selection both within 
and between groups. The behavior of each of the four 
penalties and associated norms is summarized in Table 
1. 

We propose the following generic functions that 
combine all penalties: First, for the approach based on 
sparse component weights, 

I(W fe , P fe ) = 

£ (|x fe - x fe w fe p[|| 2 + XiHWfeiu) 

k 

+ J2( Xr H W fell2+^GV^I|Wfe|| 2 ) 

ft 

ft 

= |X C -X C W C P[|| 2 + ^ L ||W C || 1 

||w c ||| + (^cVhWkh) 
k 

ft 

which has to be minimized with respect to W,t and P c 
under the constraint that PjP c = I. Second, for the 
approach based on sparse component loadings, 

L (T, P fe ) = I X, - TP[ 1 2 + X L || P c || ! + X R || P c || I 

+ £(A G V^||Pfe|| 2 + Wftlka), 

ft 



which has to be minimized with respect to T and P^ 
under the constraint that T T = I. Note that estimation 
of the loadings is not a regression problem. Therefore, 
unlike the model based on sparse weights, unique solu- 
tions are obtained when J k >I. This is the case even 
when X R = 0. 

The generic loss functions (11) and (12) allow for a 
flexible use of all these approaches to sparseness. All 
combinations of the four penalties are made possible. 
However, often some prior idea about the structure 
(selection within blocks, between blocks, both within 
and between blocks) exists such that it is not necessary 
to consider all possible combinations. Furthermore, 
some combinations are not advisable. For example the 
combination of the group lasso and elitist lasso does not 
seem useful because the behavior of the one interferes 
with the behavior of the other. By setting the appropri- 
ate tuning parameters in the objective functions to zero, 
particular known sparse approaches can be obtained. 
For example, with X G = X E = 0 the extension of sparse 
PCA to simultaneous component analysis is obtained 
and with X R = X E = 0 a sparse simultaneous component 
version of the sparse group lasso in linear regression is 
obtained. With all four tuning parameters set equal to 
zero, the ordinary simultaneous component analysis 
model results. K = 1 leads to principal component ana- 
lysis and setting X G = X E = 0 yields sparse PCA as pro- 
posed by [14]. In Table 1 a summary is given of these 
different existing sparse approaches in terms of which 
penalties are active. 
Algorithm 

Given fixed values for the different tuning parameters 
(Xi, X R , X G , and X E ) and a fixed number of components 
R, we make use of an alternating scheme to minimize 
(11) or (12) with respect to W c (or T) and P c : W c (or 
T) and P c are alternatingly updated, conditional on fixed 
values for the other parameters. For example, focusing 
on (11): 

• Step 1: Initialize W c 

♦ Step 2: Conditional on the current estimate of W c , 
obtain the optimal least-squares estimate of P c 
under the orthogonality constraint as follows (see 



Table 1 Sparse approaches 



Norm 


Properties 






Sparse approach 








Lasso 


Elastic net 


Group lasso 


Sparse group lasso 


Elitist lasso 




selection and shrinkage at the level of the concatenated data 


YES 


YES 


NO 


YES 


NO 


% 


shrinkage, groups correlated variables 


NO 


YES 


NO 


NO 


NO 


'2.1 


selection and shrinkage of entire blocks 


NO 


NO 


YES 


YES 


NO 




selection and shrinkage within each block 


NO 


NO 


NO 


NO 


YES 



Different norms used in the context of sparse approaches, their properties, and specific sparse approaches based on particular combinations of the penalties. A 
'YES' indicates that the norm is active in the approach. 
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[22]): P c = UV r with USV r the singular value 
decomposition of WjXjX; 

♦ Step 3: Check the stop criteria: 1) Is the difference 
in loss with the previous iteration smaller than le - 
12 or, 2) is a maximum of 5000 iterations reached? 
If yes, terminate, and else continue. 

• Step 4: Conditional on the current estimate of P c , 
obtain the update of W c using a majorization mini- 
mization procedure (see [23-25] for a general intro- 
duction); see the Methods Section for a derivation of 
the estimate. Return to Step 2. 

This particular scheme guarantees that the loss is a 
non-increasing function of the iterations. Due to the 
convexity (not strict) and the fact that the loss function 
is bounded from below by zero, the procedure will con- 
verge to a fixed point for suitable starting values. The 
majorization minimization (MM) procedure has a linear 
rate of convergence; this slow convergence rate may, 
however, be compensated for by the efficiency of the 
calculations (see for example [26]). To account for the 
problem that the fixed point may represent a local mini- 
mum instead of the global optimum, a multistart proce- 
dure can be used. See the Methods Section for details 
on the algorithm used to minimize (12). MATLAB code 
implementing the algorithms can be found in the sup- 
plementary material. 

Testing and implementation 

In this section we apply the proposed approach both to 
empirical and simulated data. The application to empiri- 
cal data (metabolomics) is mainly for illustrative pur- 
poses. The simulated data are used to check how the 
different penalties (and their interactions) behave under 
various conditions, and to compare the sparse compo- 
nent weights and sparse component loadings modeling 
approaches. 
Metabolomics data 

As an illustrative case, we use empirical data on the 
metabolome composition of 28 samples of E. coli. The 
different samples refer to different environmental condi- 
tions and different elapsed fermentation times. Mass 
spectrometry (MS) in combination with on the one 
hand gas chromatography (GC) and on the other hand 
liquid chromatography (LC) as a separation method was 
used, resulting in two coupled data blocks: a GC-MS 
block with the peak areas of 144 metabolites in the 28 
conditions and a LC-MS block with the peak areas of 44 
metabolites in these same conditions. Simultaneous 
component analysis was previously successfully applied 
describing the data well by five components (see [5,6]). 
However, a better understanding of the processes under- 
lying the data may be obtained by a sparse simultaneous 



component analysis (SCA) approach as this charac- 
terizes the components by a few instead of all metabo- 
lites and thus facilitates interpretation. 

Our proposed method allows to model the data in 
several ways, depending on the one hand on the choice 
of penalizing either the weights or the loadings and on 
the other hand on the particular values of the different 
tuning parameters. Therefore, we will analyze the data 
under different options, namely either under model (11) 
or under model (12) and, for both models, with several 
combinations of values for the different tuning para- 
meters. Here we explain how we chose a suitable range 
of values for the tuning parameters using the notation 
for the model with penalized weights. The different 
values of X L , X G , X E , and X R were chosen in a way that 
reflects the balance between lack-of-fit and strength of 
the penalty by setting them as a fraction of | |X C | | (max- 
imal lack-of-fit) and |W C | M with W c obtained from the 
ordinary SCA solution (maximal value of the penalty). 
Let X p 9 denote the tuning parameter of the penalty cor- 
responding to the (mixed) l Piq norm, then this yields X p q 
= y||X c || 2 /|W c | M with /taking values 0,10' 4 ,10' 3 ,10" 2 ,l6" 
1 ,0.2, 0.5, and 1. We only consider those combinations 
of non-zero values for the tuning parameters that were 
considered in the regression literature, namely the lasso, 
elastic net, group lasso, sparse group lasso, and elitist 
lasso (see Table 1). Note that the case with all tuning 
parameters equal to zero corresponds to regular simul- 
taneous component analysis. 

First we discuss the results for the approach based on 
penalized weights, then the approach based on penalized 
loadings, followed by a brief comparison of the two 
approaches. We end the empirical application section 
with a discussion on the choice and interpretation of a 
particular sparse simultaneous component analysis. 
Penalized weights 

Table 2 summarizes the results for the approach with a 
penalty on the component weights and with only one of 
the tuning parameters different from zero (the ridge 
penalty on its own is not considered as it does not 
induce sparsity). Five components are assumed (R = 5). 
For the three resulting types of sparse simultaneous 
component analyses we report on the one hand the fit 
of the model to the data and on the other hand the per- 
centage of component weights that are zero. The fit is 

defined as 1 - Ix. - X^P^^/HXd! 2 . As could be 
expected, it holds that increasing the tuning parameter 
results in a decrease of fit and an increase of the pro- 
portion of zero component weights. Comparing the 
lasso and Elitist lasso, we see that the lasso has a better 
fit for a similar proportion of zeros which may be attrib- 
uted to the fact the lasso is less constrained because it 
does not have to reflect the block structure in the 
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Table 2 Summary results for the different simultaneous 
component analyses with sparse weights 







Lasso 


GroupLasso 


ElitistLasso 


f 


Fit 


% zeros 


Fit 


% zeros 


Fit 


% zeros 


0 


0.57 


0 


0.57 


0 


0.57 


0 


0.0001 


0.57 


86 


0.57 


0 


0.57 


88 


0.001 


0.57 


87 


0.57 


9 


0.56 


92 


0.01 


0.57 


88 


0.57 


9 


0.52 


96 


0.1 


0.56 


92 


0.56 


9 


0.26 


99 


0.2 


0.55 


94 


0.55 


25 


0.16 


97 


0.5 


0.52 


97 


0.47 


45 


0.08 


98 


1 


0.43 


99 


0.23 


50 


0.04 


99 



Different panels correspond to different approaches: The lasso in the left 
panel, the group lasso in the middle panel, and Elitist lasso in the right panel. 
Within each panel, both the fit of the model to the data and the percentage 
of zero weights are reported. The different rows correspond to different 
values of the tuning parameter. 



variable selection. Both for the lasso and Elitist lasso the 
proportion of zeros is very high, even for small values of 
the tuning parameter. This could be expected as the 
number of variables is larger than the number of sam- 
ples and warrants the inclusion of a ridge penalty (see 
also [14] for the case of the lasso), else non-unique solu- 
tions are obtained. Also, at most / non-zero weights will 
be obtained for each component and this may be too 



Elastic net 




0 1e-4 1e-3 0.01 0.1 0.2 0.5 1 

\ 



sparse, e.g. in the case of micro-array gene expression 
data obtained for a limited number of (tissue) samples. 
Such solutions with only / non-zero values fit as well as 
the regular simultaneous component model. To under- 
stand this, consider a model with one component: t = 
Xw which represents an underdetermined system in 
case I < ~Lk that can be solved exactly by taking only / 
non-zero weights. The group lasso operates at the level 
of the block and therefore does not show this effect. 
The effect of adding a ridge penalty to the lasso and eli- 
tist lasso is visualized in Figure 1: The higher the value 
of the parameter that tunes the ridge penalty, the lower 
the fit and the lower the proportion of zeros. In Figure 
2 the results for the sparse group lasso (i.e., combination 
of lasso and ridge penalty) are summarized. The lines 
express the fit and the proportion of zero weights in 
function of the lasso tuning parameter with different 
lines referring to different values of the group lasso. As 
illustrated by the figure, there is a qualitative interaction 
between the two types of penalties in the sense that 
lower values of the lasso parameter have a strong effect 
on the number of zeros when the group lasso parameter 
takes lower values while, conversely, higher values of the 
lasso parameter have a strong effect when the group 
lasso parameter takes higher values: As the group lasso 
shrinks the component weights, the penalty for the lasso 



Elitist with ridge 




0 1e-4 1e-3 0.01 0.1 0.2 0.5 1 



Figure 1 Adding the ridge panel to the (Elitist) lasso. Left panel: Elastic net; Right panel: Elitist lasso with ridge penalty. Fit (full lines) and 
proportion of zeros (dashed lines) in function of the lasso tuning parameter (left panel) and in function of the Elitist lasso tuning parameter 
(right panel). The different lines refer to different values of the ridge parameter. 
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( N 



Sparse group lasso 

^ 4 =[0,0.01] 




0 1e-4 1e-3 0.01 0.1 0.2 0.5 1 

\ 

Figure 2 Sparse group lasso. Fit (full lines) and proportion of 
zeros (dashed lines) for the sparse group lasso. The different lines 
refer to different values of the group lasso penalty. 



becomes lower and hence low values of the lasso tuning 
parameter are ineffective. Addition of a ridge penalty to 
the lasso and group lasso parameters may be considered 
when grouping is important (e.g., as is usual with gene 
expression data to find modules of co-expressed genes; 
[27]). 

Penalized loadings 

A summary of the results obtained for the approach 
with sparse loadings is given in Table 3. The general 
result that increasing the tuning parameters yields a 
decrease in fit and an increase in sparsity also holds 
here. A comparison between Tables 2 and 3 shows that 
for an equal proportion of zeros, the fit of models with 
sparse loadings is (much) lower. This can be understood 
from the fact that the loadings contribute more directly 
to the reconstruction of the data than the component 
weights (compare equations (1) and (2)): for example, in 
a model with one component, a zero loading results in a 
zero vector for the reconstructed variable. This also 
explains why different from the approach based on 
penalizing the weights with an L 1 penalty, the number 
of non-zeros is not bounded by /. Table 3 also shows 
that to obtain zero loadings with the group lasso, high 
values of the tuning parameter are needed. 



Table 3 Summary results for the different simultaneous 
component analyses with sparse loadings 







Lasso 


GroupLasso 


ElitistLasso 
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Fit 


% zeros 


Fit 


% zeros 


Fit 


% zeros 


0 


0.57 


0 


0.57 


0 


0.57 


0 


0.0001 


0.57 


0 


0.57 


0 


0.57 


0 


0.001 


0.57 


0 


0.57 


0 


0.57 


A 


0.01 


0.57 


0 


0.57 


0 


0.54 


19 


0.1 


0.57 


7 


0.57 


0 


0.36 


37 


0.2 


0.56 


10 


0.56 


0 


0.28 


41 


0.5 


0.53 


20 


0.54 


0 


0.17 


47 


1 


0.46 


28 


0.46 


0 


0.10 


47 



Different panels correspond to different approaches: The lasso in the first 
panel, the group lasso in the second panel, and Elitist lasso in the last panel. 
Within each panel, both the fit of the model to the data and the number of 
zero loadings are reported. The different rows correspond to different values 
of the tuning parameter. 



Reflections on penalizing the weights versus the loadings 

As illustrated, the results obtained under the model with 
penalized loadings are different from the results 
obtained under the model with penalized weights. In 
our view, the most important differences are at the level 
of data reconstruction and at the level of interpretation. 
With respect to data reconstruction, the model based 
on weights yields a better fit while the model with 
sparse loadings may yield many zero vectors for the 
reconstructed data. Also, in this respect, the compo- 
nents based on sparse weights have a higher correlation 
with the components of the ordinary SCA solution than 
the components resulting from a model with sparse 
loadings. With respect to interpretation of the underly- 
ing components, for the model based on sparse weights 
this is done in a regression-like way, while for the 
model based on sparse loadings it is based on consider- 
ing loadings as correlations of the variables with the 
component. In ordinary SCA, the loadings are the corre- 
lations and in the sparse model we observed a close 
connection in that zero loadings represent close to zero 
correlations and higher loadings represent higher corre- 
lations. The weights do not have such a relation with 
the correlation between the variable and the component. 
Selection and interpretation of the sparse SCA solution 
As has been illustrated in the previous Results Section, 
the data can be analyzed in many ways depending on 
choices made with respect to the generic model ((11) or 
(12)) and with respect to the values of the different tun- 
ing parameters. Selection of the appropriate model is of 
key importance and substantive issues may form a good 
point of departure. First, concerning the choice of the 
generic model, the model with penalized weights seems 
more appropriate for the data at hand because all meta- 
bolites can be considered to be involved in the biological 
processes underlying the data. For applications of 
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component models with sparse loadings to microarray 
gene expression data, see [28] and [13]. Second, to 
choose appropriate values for the tuning parameters we 
consider the properties of the associated penalties. Hav- 
ing components for which the interpretation is tied 
exclusively to one type of analytical platform (corre- 
sponding to the block structure) is convenient. Also, 
because for each platform many metabolites result, spar- 
seness within each platform/block is needed. This means 
that we are interested in selection both across and 
within groups. Recently, there has been a growing inter- 
est for methods that perform such a selection [29,30], 
with particular interest for the group lasso that has been 
extended and applied in several ways [31-33]. Therefore, 
we will restrict ourselves to a group lasso type of simul- 
taneous component model, however, including a ridge 
penalty to account for the fact that grouping is useful 
(because, within each analytical method, the metabolites 
belong to several classes of strongly related compounds): 
X L > 0, X R > 0, ~k G > 0, and X L = 0. Then, we eliminate 
solutions that 1) yield components with all weights 
equal to zero, 2) yield components having non-zero 
weights for both data blocks, and 3) solutions that do 
not fit well (fit < .40) or that are not sparse (less than 
50 percent of zero weights in a block). The remaining 
solutions are summarized in Table 4 in terms of the fit 
and the number of zeros per component. The solutions 
in bold, with high values for the lasso tuning parameter 
(fi = 0.5 or 1) and low values for both the ridge and 
group lasso parameters (fa = 0.0001 and/e = 0.1; orfg = 
0.001 and f G = 0.001), show the best tradeoff between fit 



Table 4 Overview of fit and sparseness for retained 



sparse group lasso models 
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Fit 




Number of zeros in 












C1 


C2 


C3 


C4 


C5 


0.5 


0.0001 


0.01 


0.52 


178 


176 


176 


178 


179 
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0.0001 


0.1 


0.49 


166 
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159 


0.5 


0.0001 
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0.44 


150 


173 


145 
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169 


0.5 


0.001 


0.1 


0.49 


158 
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166 


156 


0.5 


0.001 


0.2 


0.44 


169 


150 


146 


145 


173 


0.5 


0.01 


0.1 


0.48 


154 


154 


166 


165 


160 
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0.0001 


0.01 


0.43 


184 


181 


183 


184 


182 
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0.001 


0.001 


0.43 


181 


185 


185 


185 


186 


1 


0.001 


0.01 


0.43 


181 


182 


184 


183 


180 


1 


0.01 


0.0001 


0.42 


180 


183 


180 


179 


174 


1 


0.0 1 


0.001 


0.42 


182 


179 


174 


180 


179 


1 


0.01 


0.01 


0.41 


177 


180 


173 


178 


181 



Solutions with five components that have non-zero component weights in 
only one data block, a fit > .40, and more than 50 percent of zero weights in 
the remaining block. The strength of the different tuning parameters is 
indicated in the first three columns, the fit is displayed in the fourth column, 
and the 5 remaining columns show for each component (C1-C5) how many of 
the 188 metabolites received a zero-weight. 



and sparsity. We select these solutions for an 
interpretation. 

The metabolites with non-zero component weights are 
displayed for both selected solutions in Table 5; Table 6 
contains the component scores corresponding to the 
weights of the solution with/^ = 0.5. Observe that the 
solution with^ = 1 is a further selection of the metabo- 
lites in the solution with^ = 0.5. The first component 
shows an effect of phenyllactate, 3,5-dihydroxypentano- 
ate, and two aromatic amino acids (phenylalanine and 
tyrosine), together with two branched-chain amino acids 
(isoleucine and valine); the corresponding component 
scores (see CI in Table 6) show a clear increasing linear 
effect of fermentation time. The second component is 
made up by metabolites like fumarate, malate, aspartate 
and are associated to succinate catabolism (see C2 in 
Table 6) making biological sense as these metabolites 
are close to succinate in central metabolism. For C3, we 
find non-zero weights for a large number of (unidenti- 
fied) disaccharides and pyruvate and lactate and high 
scores in the oxygen related conditions. The identifica- 
tion of pyruvate and lactate could be indicative of a 
changing, i.e. reduced, dissolved oxygen concentration 
in the course of the fermentation as pyruvate can be 
converted into lactate during anaerobic growth. The 
fourth component is made up by nucleotides important 
for the energy metabolism in a cell (i.e. ADP, GDP, 
UDP) and is associated to the growth condition with an 
elevated pH at the early (16hrs) phase. Finally, C5 seems 
specific for the wild type strain, although the relation to 
the metabolites guanine and thymine (both nucleobases) 
and the other metabolites is not very clear. 
Simulated data 

To validate the proposed sparse simultaneous compo- 
nent method, we make use of simulated data. The gen- 
eral setup is that data are generated under some 
specific conditions and with known structure; after addi- 
tion of noise, the performance of the method in terms 
of recovering the underlying structure is assessed. Here, 
we are particularly interested in two aspects: A first one 
is whether the penalties reflect the structure in the 
selection of the variables (i.e., between data blocks; 
within data blocks; or both between and within data 
blocks); a second one is the behavior of the method in 
function of the model (i.e., sparse weights or sparse 
loadings). We also manipulated the amount of error in 
the data (5 and 30 percent) and the degree of sparseness 
(50 and 90 percent of zero weights/loadings). All factors 
were fully crossed and for each of the resulting 2 x 3 x 
2 x 2 = 24 conditions, 5 data sets were generated, 
resulting in a total of 120 data sets. To obtain a realistic 
simulation, we generated the data using the metabolo- 
mics data described in the previous section. 28 samples 
were sampled with replacement from the original data; 
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Table 5 Metabolites with non-zero weights in the two 
selected solutions 

metabolite f L = 0.5 f L = 1 

C1 3,5-dihydroxypentanoate: 0.68 0.88 

CI valine: 0.58 0.20 

CI 3-phenyllactate or isomer: 0.55 1.21 

CI isoleucine: 0.48 

C1 tyrosine: 0.41 0.03 



0.36 
0.25 
1.99 
1.06 



0.19 



2.18 
0.39 
0. 1 I 



1.01 
1.21 
0.14 



2.0-1 
0.34 



Metabolites with non-zero component weighs for each of the five 
components (CI to C5). The component weights of two selected models are 
shown that differ in the degree of sparsity (f L = 0.5 and f L = 1). 



Table 6 Component scores for the selected solution 
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24 
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-0.19 




40 


0.06 


1.21 


-0.13 
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-0.08 




48 


0.12 


1.07 
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Wild type 


16 


-0.42 


-0.27 


-0.34 


-0.20 


-0.05 




24 


-0.23 
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0.38 


0.44 
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48 
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Component scores for each of the five components (C1-C5). The samples 
where obtained in a specific environmental condition (first column) and at a 
particular fermentation time (second column). 



then a singular value decomposition was performed to 
obtain three components: the three loading and weight 
vectors were obtained as the three right singular vectors 
corresponding to the three largest singular values and 
multiplied by these, the three component score vectors 
were set equal to the corresponding left singular vectors. 
Sparseness was imposed by setting either weights or 
loadings equal to zero as follows: In case of sparseness 
between blocks, all weights/loadings of the first compo- 
nent that correspond to the first data block (the first 
144 weights/loadings) were set equal to zero and for the 
second and third component the weights/loadings corre- 
sponding to the second data block (the last 44 weights/ 
loadings) were set equal to zero; in case of sparseness 
within blocks, 50 or 90 percent of variable indices were 
randomly sampled and their corresponding weights/ 
loadings were set equal to zero; in case of sparseness 
within and between data blocks, the two previous 



C1 3,5-dihydroxypentanoate: 0.68 

C1 valine: 0.58 

CI 3-phenyllactate or isomer: 0.55 

CI isoleucine: 0.48 

CI tyrosine: 0.41 

C1 phenylalanine: 0.40 

C1 unknown mass 304, 319 and 406: 0.01 

CI spectrum not complete6: -0.06 

C1 mixed spectrum3: -0.43 

C1 keto-gluconate (?): -0.46 

C2 fumarate: 1.40 

C2 malate: 0.96 

C2 aspartate: 0.42 

C2 monomethylphosphate: 0.39 

C2 CI 8:1 fatty acid3: 0.37 

C2 unknownl: 0.37 

C2 spectrum not complete4: 0.20 

C2 mixed spectrum2: 0.19 

C2 glycerate: 0.14 

C2 unknown20: 0.02 

C3 lactate: 1 .23 

C3 pyruvate: 0.71 

C3 disaccharide12: 0.49 

C3 3-dehydroquinate: 0.38 

C3 disaccharide8: 0.33 

C3 citrate: 0.29 

C3 disaccharide9: 0.27 

C3 unknown mass 318 and 420: 0.17 

C3 unknown mass 217 and 191: 0.17 

C3 disaccharide13: 0.11 

C3 2-hydroxybutanoate: 0.09 

C4 ADP: 1.16 

C4 GDP: 0.96 

C4 UDP-glucose: 0.71 

C4 UTP: 0.34 

C4 unknown27: 0.20 

C4 GMP: 0.20 

C4 FBP: 0.09 

C5 spectrum not found7: 1.41 

C5 guanine: 0.73 

C5 orotate: 0.51 

C5 spectrum not complete5: 0.31 

C5 mixed spectrum6: 0.24 

C5 N-acetylaspartate 

C5 + beta-phenyl pyruvate: 0.23 

C5 thymine: 0.12 
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strategies were combined. The resulting component 
loadings and weights were used to generate the true 
data part using the model part of expressions (1) and 
(2) (i.e., without the addition of the residual matrices). 
Noise was then added to this true part of the data with 
the noise being generated from a normal distribution 
with mean zero and variance such that these residual 
matrices account for 5 or 30 percent of the total varia- 
tion [34]. Each of the data sets was analyzed under both 
models (sparse weights or sparse loadings) and with 
varying values for the tuning parameters (f equal to 0, 



10~ 3 , 0.1, 0.5, and 10). The Elitist lasso penalty was only 
combined with the ridge penalty because it interferes 
with the lasso and group lasso (see earlier). 

In the discussion of the results of the simulation 
study, we first focus on the conditions where the data 
are generated and analyzed under the same model 
(either sparse weights or sparse loadings), the error 
amounting to 30 percent of the total variation in the 
data, and the ridge penalty set equal to the smallest 
non-zero value. Figures 3 and 4 display boxplots of the 
proportion of variables correctly classified (selected 
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Figure 3 Boxplots of the proportion of recovered variables: 50 percent of true zeros. Boxplots of the proportion of variables correctly 
classified (selected versus dropped) in function of the value of the tuning parameters. Case with 50 percent of the variables dropped. The different 
panels refer to the different combinations of structure in the variable selection (from top to bottom: within blocks, between blocks, within and 
between blocks) and of sparseness approach (from left to right: lasso, group lasso, Elitist lasso, and sparse group lasso). The panels referring to the 
sparse group lasso are with varying values for the lasso tuning parameter and with the group lasso tuning parameter fixed at f = 10. 
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Figure 4 Boxplots of the proportion of recovered variables: 90 percent of true zeros. Boxplots of the proportion of variables correctly 
classified (selected versus dropped) in function of the value of the tuning parameters. Case with 90 percent of the variables dropped. The different 
panels refer to the different combinations of structure in the variable selection (from top to bottom: within blocks, between blocks, within and 
between blocks) and of sparseness approach (from left to right: lasso, group lasso, Elitist lasso, and sparse group lasso). The panels referring to the 
sparse group lasso are with varying values for the lasso tuning parameter and with the group lasso tuning parameter fixed at f= 10. 



versus dropped) in function of the value of the tuning 
parameter. Figure 3 refers to the case with 50 percent 
zero weights/loadings, Figure 4 to the case with 90 per- 
cent zero weights/loadings. In each Figure, the different 
panels refer to the different combinations of structure in 
the variable selection (from top to bottom: within 



blocks, between blocks, within and between blocks) and 
of sparseness approach (from left to right: lasso, group 
lasso, Elitist lasso, and sparse group lasso). The panels 
referring to the sparse group lasso are with varying 
values for the lasso tuning parameter and with the 
group lasso tuning parameter fixed at f G = 10. In 
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general, the results confirm the expected relation 
between the structure of the variable selection and the 
different approaches to sparseness: The best recovery 
for selection within blocks is by Elitist lasso with a value 
of 0.5 for the tuning parameter f E , for selection between 
blocks is by the group lasso with f G = 10, and for selec- 
tion between and within blocks the sparse group lasso 
(fi = 0.1 for the lasso). Deviations from the expected 
behavior occur for the sparse group lasso when selection 
is both within and between blocks in case of many zeros 
(see Figure 4): the lasso and Elitist lasso then outper- 
form the group lasso. This can be attributed to the fact 
that the group lasso is less aggressive than the lasso and 
Elitist lasso [11]. On the other hand, the lasso and Elitist 
lasso perform less well when selection is within blocks 
and the true structure is not so sparse (50 percent of 
zeros, see the top row of Figure 3) because of their 
aggressive behavior. Note that a penalty that selects 
between groups in a more aggressive way was proposed 
by [11]. The same pattern of results is obtained when 
the error amounts to 5 percent (though shifted upwards 
as in these conditions the status of the variables is better 
recovered) or when the tuning parameter of the ridge 
penalty takes higher values. In case the ridge equals 
zero, the box plots show worse results for the lasso and 
Elitist tuning parameters equal to zero (because there 
are more variables than objects thus at most 28 non- 
zero values are obtained for the approach based on 
sparse weights). 

A second point of interest, is the influence of the model 
used to generate and analyze the data. Figure 5 displays 
four panels of boxplots for the proportion of correctly 
classified variables. Within panels, the boxplots are dis- 
played in function of the block structure present in the 
variable selection. The upper panels refer to data gener- 
ated under a model with sparse loadings, the lower 
panels to data generated under a model with sparse 
weights. The panels at the left were obtained when ana- 
lyzing the data with a model based on sparse weights and 
at the right with sparse loadings. In general, analyzing the 
data with the sparse weights model yields less misclassifi- 
cations than using the sparse loadings model. However, 
generating the (underlying) data under a model with 
sparse weights, in general, results in more misclassifica- 
tions than generating under a sparse loadings model. 
These results can be explained by the more direct rela- 
tion between the loadings and generated or modeled 
data: Generating the data with sparse loadings imposes a 
clearer structure than generating them with sparse 
weights; analyzing/modeling the data with sparse load- 
ings imposes a stronger structure on the modeled data 
than modeling them with sparse weights. This is because 
1) unlike a zero loading, a zero weight for a variable does 
not necessarily imply a modeled score of zero, because a 
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Figure 5 Comparing sparse weights versus sparse loadings. 

Boxplots for the proportion of correctly classified variables. Within 
panels, the boxplots are displayed in function of the block structure 
present in the variable selection. The upper panels refer to data 
generated under a model with sparse loadings, the lower panels to 
data generated under a model with sparse weights. The panels at 
the left were obtained when analyzing the data with a model 
based on sparse weights; the panels at the right with a model 
based on sparse loadings. 



zero weight for one variable can be compensated by non- 
zero weights for other variables, and 2) unlike shrinking 
the weights, shrinking the loadings results more directly 
in shrunken modeled scores. The latter can be explained 
by the dependence of the scale of the data, as modeled by 
PCA model (1), on the scale of the loadings (the model 
has orthonormal component scores). 

Discussion 

We proposed an extension of sparse PCA to the context 
of several data blocks, relying on a generic modeling fra- 
mework that allows either for sparse component weights 
or for sparse component loadings and that incorporates 
several approaches that were taken to sparsity in the 
regression literature (including the lasso, elastic net, 
group lasso, Elitist lasso, and sparse group lasso). A very 
flexible algorithm was developed that allows to analyze 
the data under a variety of approaches that take the 
structure of the data in different ways into account. It 
also allows for combinations of penalties that were not 
yet considered in the regression literature. 

The flexibility of the approach is important as often a 
particular kind of structure is needed from data integra- 
tion methods. The group lasso is a popular tool to find 
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structures that only involve one data block. This is for 
example relevant in comparative genomics when the 
focus is on divergence [35] or on tissue-specificity [36]. 
Elitist lasso, on the other hand, finds sparse structures 
that involve each of the data blocks. Not only is this of 
relevance in the aforementioned case of comparative 
genomics to find conserved processes, but also in a top- 
down systems biology approach. For example, to inte- 
grate microarray gene expression data and interaction 
data with the aim of finding transcription factors and 
their target genes [37]. 

Although the model and algorithm were proposed in 
the context of simultaneous component analysis, it can 
be easily translated to the context of principal compo- 
nent analysis and also of regression analysis. In fact the 
algorithm can be used as it is for PCA and the adapta- 
tion to regression analysis is a minor one. In the context 
of simultaneous component analysis, adaptations of the 
model (and algorithm) to a context that allows for dif- 
ferent values of the tuning parameter for each compo- 
nent and/or each block would be valuable. However, 
such an extension is not trivial. Moreover, the problem 
of selecting an optimal model becomes more difficult in 
that more parameters need to be tuned and this would 
make the choice of selecting appropriate values for the 
tuning parameter even more difficult than it already is. 
A major theoretic challenge for many sparse methods is 
to find a good method to select the value of the tuning 
parameters. 

Conclusions 

We offered a flexible and sparse framework for data 
integration based on simultaneous component methods. 
The method is flexible both with respect to the compo- 
nent model and with respect to the sparse structure 
imposed: Sparsity can be imposed either on the compo- 
nent weights or on the loadings, and can be imposed 
either within data blocks, across data blocks, or both 
within and across data blocks. As such, it allows to find 
structures exclusively tied to one data platform as well 
as structures that involve all data platforms. A penalty 
based approach is used that includes the lasso, the ridge 
penalty, the group lasso, and Elitist lasso. The method 
includes principal component analysis, sparse principal 
component analysis, and ordinary simultaneous compo- 
nent analysis as special cases. Real and simulated data 
were used to validate the method. We believe the 
method offers a very flexible and versatile tool for many 
data integration problems. 

Methods 

Here we derive the estimates used in the alternating 
least squares and iterative majorization algorithm. First, 
it is shown how the conditional estimates for the 



objective function relying on sparse component weights 
can be obtained and then for the objective function rely- 
ing on sparse loadings. 

Sparse component weights 

The generic objective function that we rely on to find a 
simultaneous component solution with sparse compo- 
nent weights is to minimize 

L(W k ,P k ) = 

(fx* - X fe W fe P[|| 2 + X L \\W k \\i + A* nwtiil) 

k 

+x;(wwiw fe ii 2 +A E iiWfcii 1 , 2 ) 

= \\Xc - x.w.pJH 2 + x L \\w c h + x R \\w c \\l 

+X!(Wyw fe ii 2 +A E iiWfciii, 2 ) 

k 

= tr [(X, - XcWcPcY & - X^PJ)] 

+AiIIw c || 1 + a.h||w c |i! (13) 

+x;( x Gv^iiw fe ii 2 +x E iiw fe ii 1 , 2 ) 

k 

= tr [Xjx, - 2X t T X c W c P c T + P.WjxJXcWcPj] 

+A,il|W e || 1 +A. H ||W e ||i 

+ ^(^cAllW, ; || 2 +A £ ||W fe || 1 , 2 ) 
k 

= trXjX, - 2trXjX c W c Pj + trPcWjxJXW^ 
+Ai||W c || 1 +A. H ||W c ||! 

+X! (w5ftliw fe || 2 + MlWfeii^), 

k 

with respect to W c and P c and under the constraint 
PjP c = I- \t> ^r> an d ^£ are considered to be known 
non negative constants. We use an alternating approach 
in which each set of parameters is updated in turn while 
keeping the remaining sets fixed on their last update. 
Let P c be the first set to be updated, conditionally upon 
fixed values for W c . Rewriting (13) gives 

L (W c , P c ) = h - 2trWjX c T X c P c 

T T T V / 

+trP*P c WjXjX c W e 

with 

h = XX + kllWJl! + X R \\WAl + J2 k (W/kllW k || 2 + A. B ||Wk|| ll2 ) 
the terms that are constant with respect to P c . Using 
pjp c = I yields 

L (W CI P c ) =k 2 - 2trW^XjX c P c (15) 

with fe 2 = ki +trW c r X c r X c W c . The minimization of (15) 
under the constraint of orthogonal loadings is equivalent 
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to the maximization of trWjXjXcPc under the same 
constraint. This is a problem with known closed form 
solution [22] 



P C =VU J 



(16) 



diagonal matrix containing the |w| rfe | ° n its diagonal. 
Second, we consider the k Group Lasso penalty terms 

/ \ 1/2 

^G|Wfc| 2 = X G { 2^ wf ir \ ■ A majorizing function for 



with U and V the left and right singular vectors of the root ls ( see t 39 ^ 



wJx^X, 

The minimization of (13) with respect to W c is not a 
standard problem due to the Lasso, Group Lasso, and 
Elitist Lasso penalties on W c . We will make use of a 
numerical procedure, known as Majorization Minimiza- 
tion (MM) or also Iterative Majorization, which has 
been proven to be a superior algorithmic strategy in reg- 
ularization problems [25,38]. Briefly stated, MM replaces 
functions that are complicated to minimize by surrogate 
functions that are easy to minimize, that lie on/above 
the original function, and that touch the original func- 
tion in the so-called supporting point. These properties 
lead to the sandwich inequality [23]. 

A useful property of majorizing functions is that a 
sum of terms can be majorized by majorizing the terms 
[39]. Therefore, a majorizing function can be obtained 
for (13) by finding a linear or quadratic majorizing func- 
tion for each of the penalty terms except the ridge. First 
we consider the Lasso penalty: 



T,hk,r,k k L \ W 1 



Applying the additivity 



property again, we need to find a majorizing function 
for \wj hrh \ . Such a function is [40] 



1 m 



w 



+ 2 PV| 



(17) 



with U^ kTk the current estimate of that was 



obtained in the previous iteration. This yields 

a' 1 
2 ^ 



jk,r,k jk,r,k \ \w. 



i 



a, 



+ — w, 

2 I ll! 



rk 



(18) 



^Vec(W c ) r D! Vec(W c ) + fe 3 , 



with the Vec notation indicating that the matrix is 
vectorized, with fc 3 = J2j t r k 1 r^fk ' and w 'th ^i a 




(19) 



E< 



h + y Vec(W c ) T D 2 Vec (W c ) , 



with fe 4 = -£ J2 k \J2 h , r (fh) ) ' and WitK ° 2 a 



-1/2 



diagonal matrix containing the \^*v J ° n 

its diagonal. Third, we majorize the Elitist Lasso penalty 
term X E \W k \ h2 = ^(£ ft , r \wj k r{f with the following 
quadratic function (see [39]), 



1/2 



k \jk.r ) 



IV 



(20) 



Jh,r 



= k 5 + * £ Vec(W c ) T D 3 Vec (W c ) , 
with D 3 a diagonal matrix containing the 

(Efcr |<r|) (|«&r|) ' 011 itS dia g° naL 

Combining (13) with the results (18), (19), and (20), 
we obtain the following majorizing function for (13): 
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I(W C ,P C ) = 

|| Xc - X^Pj 1 2 + Xi || W e || i + Xb || W c f 2 

+ X|( A Gll W fcll2+^IIWtll 1 , 2 ) 
fe 

- IVecfXJ-VecfXcWcPj)! 2 

+X i ||W < ;|| 1 +X B ||W C ||g 

+ ^( A Gl|W t || 2 +Ad|W l || 1 , 2 ) 
k 

< HVec (X.) - (P c 0 XJ Vec (W,)|| 2 

+Vec(W c ) :r (D ilip ) Vec (WJ + fc 
= Q (W c , P e ) , 



(21) 



with D 



Xr 



sup - —H l + yD 2 + X E D 3 + X R I, I, an identity 



matrix, and k = I< 3 + k 4 + k 5 . This function can be mini- 
mized with respect to W c by finding the value for which 
the partial derivative of (21) is zero. The partial deriva- 
tive equals 

9Q 
3W C 

-2(P C ® XJ T [Vec (X,) - (P c ® XJ Vec (Wc)] (22) 
+2 (D 51jp )Vec(W c ), 

and is equal to zero for 

Vec (W c ) = 

[D sup + (P c ® Xcf (P c ® X,)]" 1 (Pj ® Xc) 1 Vec (Xc) (23) 
= [D sup + I ® (xJXcJJ'Vec (X%P C ) , 

where the inverse is taken of a block-diagonal matrix. 
W c is obtained by rearranging Vec(W c ). Note that the 
second derivative is positive so (23) is a minimum of 
(21). In this equation, the penalty terms occur as diago- 
nal matrices that are summed together in the matrix 
D sup and with the variance-covariance matrix of the 
data; the resulting matrix is inverted and will be domi- 
nated by large values on the diagonal (yielding small 
values after inversion). This shows the behavior of the 
penalties: increasing the tuning parameters results in 
such large diagonal values; furthermore, the diagonal 
matrices themselves are inverse functions of the weights 
in the previous iteration of the algorithm such that 
small weights further enhance the shrinkage or selec- 
tion. Note that the matrix to be inverted in equation 
(23) is of the form D + A A with D a diagonal matrix; 
then, the following holds [41], 

(D + A T A) _1 = D- 1 - D ^(i + XD- 1 X T )" 1 AD- 1 (24) 
which may be useful when J k >I. 

Sparse loadings 

The generic objective function that we rely on to find a 
simultaneous component solution with sparse compo- 
nent weights is to minimize 



L(T,P fe ) = 

£(!*• 



■TP[|| 2 +l L ||P fe || 1+ A i; 



k\\ 2 



(25) 



+ £(AGV / ^l|Pfell 2 +A£||Pfe|| 1 , 2 

fe 

|X e -TPj| 2 +* L ||P e || 1 +*ji \\P C 

fe 

tr[(X c -TP?;) T (X c -TP^)] 
+*lI|P c |Ii+*k llPclli 

+ £(W/*HP*ll2+ Wklll,2 

fe 

tr [XjXc - 2XjTP^ + P c T T TPj] 
+X L \\Pc\\i+X R \\P C \\ 2 2 

+ £(WAI|Pfell2+A£||Pfe|| 1 , 2 

fe 

trXjXc - 2trX[TP[ + trP c TPj 
+X L \\Pch+X R \\P c \\j 

+ £(A G v / ^IIPfell2+A£l|Pfell 1 , 2 



with respect to T and P^ under the constraint that 
T r T = I. X L , X R , X G , and X E are considered to be known 
non negative constants. In case all tuning parameters 
are equal to zero, a regular simultaneous component 
analysis results and in that case the algorithm should be 
based on SVD of the concatenated data. We use an 
alternating approach in which each set of parameters is 
updated in turn while keeping the remaining sets fixed 
on their last update. Let T be the first set to be updated, 
conditionally upon fixed values for P c . Rewriting (25) 
gives 



L (T, P c ) = k 6 - 2tiX T JP T c 
with 



(26) 



fe 6 = tr[xjx,- + P c pj] + XiUPJi + X R \\Pa\\l + (teVlkWPkh + XtllPtllu) 

the terms that are constant with respect to T. Minimiz- 
ing function (26) is equivalent to maximizing trPjXjT 
with known closed form solution [22] 



T = VU 



T 



(27) 



with U and V the left and right singular vectors of 
Pj X r . 

Combining (25) with the results (18), (19), and (20) 
adapted to the case of loadings, we obtain the following 
majorizing function for (25): 
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L(T,P C ) = 

||X £ -TP C T || 2 + X L ||P C || 1 

llPjf + ^(*Gl|Pfell2 + Wfelll, 2 ) 

k 

= ||Vec (X,) - Vec (TPj) || 2 + Wdli 

l|P C || 2 + ^(^Gl|Pfcll2+^HPfclll,2) 

k 

< I Vec (P c ) - (1 0 T) Vec (Pj) || 2 
+Vec(P c ) T (D sup )Vec(P c ) + fe 
= Q(T,P C ), 

and the first derivative of Q(T, P c ) with respect to P c 
is equal to zero for 

Vec(P c T ) = 

[d sup + (I T 0 T) T (I T 0 T c )] _1 (I T 0 T) T Vec 
= [D^ + lJ-Vec^X,). 

Additional material 



Additional file 1: MATLAB code. The zip file SparseSCAzip contains 
four MATLAB functions and a script (test script.m) to illustrate the use of 
the main function (sparsesca weights. m). The main functions sparsesca 
weigths.m and sparsescaloadings.m implement the proposed sparse 
simultaneous component algorithms. 
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